Hi Naresh - I saved two crucial artifacts as part of fasttext training in gensim
and surely these techniques will apply to word2vec training in gensim
as well.
First, I saved the word embeddings in the w2v text file format.
...
# save w2v format since this is useful for PyTorch
if save_w2v:
w2v_out_filepath = os.path.join(save_dir, f'{file_name}_w2vformat.txt')
model.wv.save_word2vec_format(w2v_out_filepath)
print(f'Saved {w2v_out_filepath}')
...
Second, I saved the word frequency Counter
dictionary that is required by torchtext.vocab.Vocab
.
...
# save word frequencies since this is useful for PyTorch
counts = Counter(
{word: vocab.count
for (word, vocab) in model.wv.vocab.items()})
freq_filepath = os.path.join(save_dir, f'{file_name}_word_freq.json')
save_json(freq_filepath, counts)
print(f'Saved {freq_filepath}')
...
Then I load these files into torchtext.vocab.Vectors
and torchtext.vocab.Vocab
objects, respectively, as described above, and it all comes together beautifully.
I hope that helps!